A common question an analyst faces is determining whether the observed effect/difference in the data is real or not.
For that, they define a null hypothesis, which supposes that the observed difference/effect in the data is due to randomness, with the alternative being the opposite. A common workflow is to perform a test statistic on the data to describe the difference/effect in the data. Based on the test statistic, a p-value is calculated and if the p-value is below a certain threshold, the null hypothesis can be rejected.
A significant pain point is selecting the appropriate test and ensuring it works for your data and use case. There are dozens of different tests, all with different assumptions and nuances.
A computer scientist named Allen Downey coined the now famous “There is only one test” framework, in which essentially all the common hypothesis tests can be operationalized like so:
This common workflow underpins the functions in the infer package. It is designed around the 4 main tasks an analyst performs when conducting common statistical hypothesis tests:
specify(): specify the variable(s) and relationship between them
hypothesize(): define null hypothesis
generate(): generate data reflecting the null hypothesis
calculate(): calculate a distribution of statistics from the generated data to form the null distribution
Common workflow for common hypothesis testing
To install the package:
install.packages("infer")
We’ll also need:
library(tidyverse)
library(infer)
library(palmerpenguins)
library(ggridges)
library(broom)
library(skimr)
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
penguins_sans_chin <- penguins %>% filter(species != "Chinstrap")
penguins_sans_chin %>%
ggplot(aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot() +
scale_fill_brewer(palette = 3) +
# scale_y_continuous(breaks = seq(1, 10, 1)) +
labs(
title = "Body mass by species",
x = NULL, y = "Body mass (g)") +
theme_minimal()
penguins_sans_chin %>%
ggplot(aes(x = body_mass_g, fill = species)) +
geom_histogram(binwidth = 100, color = "white") +
scale_fill_brewer(palette = 3, guide = F) +
labs(y = "Count", x = "Body mass (g)") +
facet_wrap(~ species) +
theme_minimal() +
theme(panel.grid.major.x = element_blank())
penguins_sans_chin %>%
ggplot(aes(x = body_mass_g, y = fct_rev(species), fill = species)) +
stat_density_ridges(quantile_lines = TRUE, quantiles = 2, scale = 3, color = "white") +
scale_fill_brewer(palette = 3, guide = FALSE) +
# scale_x_continuous(breaks = seq(0, 10, 2)) +
labs(x = "Body mass (g)", y = NULL) +
theme_minimal()
penguins_sans_chin %>%
group_by(species) %>%
summarize(avg_body_mass = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 2 × 2
## species avg_body_mass
## * <fct> <dbl>
## 1 Adelie 3701.
## 2 Gentoo 5076.
We will use an infer workflow to determine if the difference in the body mass medians between the Adelie and Gentoo species.
The first step is using specify() to indicate which variables to use as the outcome, and which ones to use as the predictor.
The second step is adding calculate() to calculate the difference in median values between the two species,
diff <-
penguins_sans_chin %>%
specify(body_mass_g ~ species) %>%
calculate("diff in medians",
order = c("Gentoo", "Adelie")) # gives the order to subtract the mean values in
diff
## Response: body_mass_g (numeric)
## Explanatory: species (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 1300
The observed difference in the median body masses is +1300 grams for Gentoo relative to Adelie. How likely is this difference due to chance/randomness? We need to test this, by determining whether it is likely to see this difference in a simulated world where there is no difference between body masses that’s explained by species type.
To do this, we use hypothesize() that the two species are independent of each other and that there’s no difference between the two, and then generate() to create a null distribution. Finally, we calculate() the difference in medians for the two species.
null <-
penguins_sans_chin %>%
specify(body_mass_g ~ species) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate("diff in medians",
order = c("Gentoo", "Adelie"))
null %>%
visualize() +
geom_vline(xintercept = diff$stat, color = "red", size = 1) +
labs(x = "Difference in median body masses\n(Gentoo − Adelie)",
y = "Count",
subtitle = "Red line shows observed difference in median body masses") +
theme_minimal() +
theme(panel.grid.minor = element_blank())
Visually, we can see that the observed difference (red line) is very far from the null distribution values. We can use get_pvalue() to be sure:
null %>%
get_pvalue(obs_stat = diff, direction = "both") # direction both to get two-tailed p-value
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
The p-value is so small that it’s displaying as 0. This means that there’s no chance you’ll see a difference in median body masses this big between Adelie and Gentoo in this (simulated) world where there’s no difference. We can safely infer that the difference in median body mass between the two species is significant.
ANOVAs are used to analyze differences in group means. As an illustrative example, we’ll look at the relationship between penguin species and bill length.
penguins %>%
ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
geom_boxplot() +
scale_fill_brewer(palette = 5) +
# scale_y_continuous(breaks = seq(1, 10, 1)) +
labs(
title = "Bill length by species",
x = NULL, y = "Bill length (mm)") +
theme_minimal()
penguins %>%
ggplot(aes(x = bill_length_mm, fill = species)) +
geom_histogram(binwidth = 5, color = "white") +
scale_fill_brewer(palette = 5, guide = F) +
labs(y = "Count", x = "Bill length (mm)") +
facet_wrap(~ species) +
theme_minimal() +
theme(panel.grid.major.x = element_blank())
penguins %>%
ggplot(aes(x = bill_length_mm, y = fct_rev(species), fill = species)) +
stat_density_ridges(quantile_lines = TRUE, quantiles = 2, scale = 3, color = "white") +
scale_fill_brewer(palette = 5, guide = FALSE) +
# scale_x_continuous(breaks = seq(0, 10, 2)) +
labs(x = "Bill length (mm)", y = NULL) +
theme_minimal()
penguins %>%
group_by(species) %>%
summarize(avg_bill_length = mean(bill_length_mm, na.rm = TRUE))
## # A tibble: 3 × 2
## species avg_bill_length
## * <fct> <dbl>
## 1 Adelie 38.8
## 2 Chinstrap 48.8
## 3 Gentoo 47.5
Using the same functions as earlier, we can compute the 𝐹 statistic.
f_stat <-
penguins %>%
specify(bill_length_mm ~ species) %>%
hypothesize(null = "independence") %>%
calculate(stat = "F")
f_stat
## Response: bill_length_mm (numeric)
## Explanatory: species (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 411.
The 𝐹 statistic is large: 410. We need to compare this observed 𝐹 statistic to a null distribution that supposes there is no relationship between species and bill length to see whether it is likely to see this observed 𝐹 statistic in this world.
null_anova <-
penguins %>%
specify(bill_length_mm ~ species) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
null_anova %>%
visualize()
null_anova %>%
visualize() +
shade_p_value(f_stat, direction = "greater")
null_anova %>%
get_p_value(obs_stat = f_stat,
direction = "greater")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
We can see from visualization and p-value that there is 0 chance that we would see the observed test statistic in this world where there is no relationship between species and bill length. We can safely infer that the differences in bill lengths among species is most likely not due to randomness.
Explore the following dataset from the General Social Survey:
gss
## # A tibble: 500 × 11
## year age sex college partyid hompop hours income class finrela weight
## <dbl> <dbl> <fct> <fct> <fct> <dbl> <dbl> <ord> <fct> <fct> <dbl>
## 1 2014 36 male degree ind 3 50 $2500… midd… below a… 0.896
## 2 1994 34 female no degr… rep 4 31 $2000… work… below a… 1.08
## 3 1998 24 male degree ind 1 40 $2500… work… below a… 0.550
## 4 1996 42 male no degr… ind 4 40 $2500… work… above a… 1.09
## 5 1994 31 male degree rep 2 40 $2500… midd… above a… 1.08
## 6 1996 32 female no degr… rep 4 53 $2500… midd… average 1.09
## 7 1990 48 female no degr… dem 2 32 $2500… work… below a… 1.06
## 8 2016 36 female degree ind 1 20 $2500… midd… above a… 0.478
## 9 2000 30 female degree rep 5 40 $2500… midd… average 1.10
## 10 1998 33 female no degr… dem 2 40 $1500… work… far bel… 0.550
## # … with 490 more rows
See ?gss for more information on the dataset.
After performing an exploratory data analysis, answer the following:
Do Americans work the same number of hours a week regardless of whether they have a college degree or not?
Is there an association between age and political party affiliation in the United States?
As seen in lecture, the point of linear regression is to determine the relationship between an outcome variable and its explanatory/predictor variable(s). When there is only one explanatory/predictor variable, this is called simple linear regression.
We’ll explore the Ames housing data set (De Cock 2011) which contains data on 2,930 properties in Ames, Iowa. It’s part of tidymodels. If you don’t have tidymodels downloaded:
install.packages("tidymodels")
data(ames, package = "modeldata")
glimpse(ames)
## Rows: 2,930
## Columns: 74
## $ MS_SubClass <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Cond <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Cond <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Gr_Liv_Area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotRms_AbvGrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Garage_Type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
options(scipen = 999)
ames %>%
ggplot(aes(x = Sale_Price)) +
geom_histogram(bins = 50, color = "white", fill = "lightpink") +
theme_minimal()
Given the right-skew of the data, sale price should be logged:
ames %>%
ggplot(aes(x = Sale_Price)) +
geom_histogram(bins = 50, color = "white", fill = "lightpink") +
scale_x_log10() +
theme_minimal()
ames %>%
ggplot(aes(x = Gr_Liv_Area)) +
geom_histogram(bins = 50, color = "white", fill = "skyblue") +
# scale_x_log10() +
theme_minimal()
ames %>%
ggplot(aes(x = Gr_Liv_Area)) +
geom_histogram(bins = 50, color = "white", fill = "skyblue") +
scale_x_log10() +
theme_minimal()
ames <-
ames %>%
mutate(Sale_Price = log10(Sale_Price),
Gr_Liv_Area = log10(Gr_Liv_Area))
Visualize relationship between sale price and:
{tidymodels} represents the R packages inside the tidyverse that focus on the modeling process:
share an underlying philosophy, grammar and data structures
principles: design for humans, reusing data structures, functional programming with the piping operator %>%
{tidymodels} is a group of R packages for modeling that:
have a consistent API
individual packages designed to do different things but also designed to work together
Motivation behind the creation of tidymodels:
Streamline the modeling process from end to end
Produce high quality models
Encourage good methodology and statistical practice
- Why is it important to be tidy? - a user-interface that fits all users needs (not just the makers) - flexible and can evolve over time
### Using tidymodels
For our first model, we’ll look into the relationship between sale price and living area.
The first step in a tidymodels workflow is to specify the model - linear regression - with parsnip::set_mode() and parsnip::set_engine():
set.seed(2021)
library(tidymodels)
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
lm_spec
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Once we have the specification we can fit it by supplying a formula expression and the data we want to fit the model on. The formula is written on the form y ~ x where y is the name of the response and x is the name of the predictors. The names used in the formula should match the names of the variables in the data set passed to data.
lm_fit <-
lm_spec %>%
fit(Sale_Price ~ Gr_Liv_Area, data = ames)
lm_fit
## parsnip model object
##
## Fit time: 1ms
##
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
##
## Coefficients:
## (Intercept) Gr_Liv_Area
## 2.3583 0.9078
lm_fit %>%
pluck("fit") %>%
summary()
##
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90237 -0.06361 0.01147 0.07558 0.37360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.35830 0.05057 46.63 <0.0000000000000002 ***
## Gr_Liv_Area 0.90781 0.01602 56.66 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1223 on 2928 degrees of freedom
## Multiple R-squared: 0.523, Adjusted R-squared: 0.5228
## F-statistic: 3210 on 1 and 2928 DF, p-value: < 0.00000000000000022
Not tidy, so we’ll use broom::tidy() to get the key parameter information:
tidy(lm_fit)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.36 0.0506 46.6 0
## 2 Gr_Liv_Area 0.908 0.0160 56.7 0
broom:glance() can be used to extract the model statistics:
glance(lm_fit)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.523 0.523 0.122 3210. 0 1 2001. -3996. -3978.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
If you’re satisfied with the model fit, you can use predict() to generate predictions:
predict(lm_fit, new_data = ames)
## # A tibble: 2,930 × 1
## .pred
## <dbl>
## 1 5.28
## 2 5.04
## 3 5.19
## 4 5.38
## 5 5.27
## 6 5.27
## 7 5.20
## 8 5.18
## 9 5.27
## 10 5.31
## # … with 2,920 more rows
Setting type = "conf_int" returns a 95% confidence interval:
predict(lm_fit, new_data = ames, type = "conf_int")
## # A tibble: 2,930 × 2
## .pred_lower .pred_upper
## <dbl> <dbl>
## 1 5.28 5.29
## 2 5.03 5.05
## 3 5.19 5.20
## 4 5.37 5.38
## 5 5.27 5.28
## 6 5.26 5.27
## 7 5.19 5.20
## 8 5.17 5.18
## 9 5.27 5.28
## 10 5.31 5.32
## # … with 2,920 more rows
We can use bind_cols() or tidymodels’ augment() to obtain the observed sale prices and the predicted sale prices:
pred_obs <-
augment(lm_fit, new_data = ames) %>%
select(Sale_Price, .pred)
pred_obs
## # A tibble: 2,930 × 2
## Sale_Price .pred
## <dbl> <dbl>
## 1 5.33 5.28
## 2 5.02 5.04
## 3 5.24 5.19
## 4 5.39 5.38
## 5 5.28 5.27
## 6 5.29 5.27
## 7 5.33 5.20
## 8 5.28 5.18
## 9 5.37 5.27
## 10 5.28 5.31
## # … with 2,920 more rows
Notice how the output is always a tibble. This makes results very flexible for more processing and plotting.
ggplot(pred_obs, aes(x = Sale_Price, y = .pred)) +
# Create a diagonal line:
geom_abline(lty = 2) +
geom_point(alpha = 0.5) +
labs(y = "Predicted Sale Price (log10)", x = "Sale Price (log10)") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
We can obtain numerous performance metrics using metric_set():
ames_metrics <- metric_set(rmse, rsq, mae)
ames_metrics(pred_obs, truth = Sale_Price, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.122
## 2 rsq standard 0.523
## 3 mae standard 0.0917
Build a model using another predictor from the Ames dataset.